The dataset for this exercise contains 113,937 loans with 81 variables on each loan. The details of each variable will be shown below. The idea behind this data analysis is to show if there are any correlations between the variables which can help in showing lenders' behaviours and loan applications. Questions like what factors affects a loan's outcome status, What affects the borrower’s APR or interest rate will be answered with the use of univariate, bivariate and multivariate exploratory data means. That is, after some preliminary wrangling, of course. The key for the data dictonary is shown below
Rubric Tip: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.
Rubric Tip: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.
Rubric Tip: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
loans = pd.read_csv('prosperLoanData.csv')
loans
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113933 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 113934 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 113936 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | ... | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
113937 rows × 81 columns
loans.describe()
| ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | ... | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 40.830248 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | ... | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 10.436212 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | ... | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 12.000000 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | ... | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 36.000000 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | ... | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 36.000000 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | ... | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 36.000000 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | ... | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 60.000000 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | ... | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
8 rows × 61 columns
loans.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
loans_copy = loans.copy()
loans_copy.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
loans_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
# check for duplicates
loans_copy[loans_copy.duplicated()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
|---|
0 rows × 81 columns
loans_copy.Term.value_counts()
36 87778 60 24545 12 1614 Name: Term, dtype: int64
loans_copy.Term.hist()
<AxesSubplot:>
# Since there are only 3 loan term periods, for clarity, we will change the periods into short term, medium term and long term
# and then convert the data type to categorical
loans_copy.Term = loans_copy.Term.map({12:'Short Term', 36:'Medium Term', 60:'Long Term'})
loans_copy.Term.value_counts()
Medium Term 87778 Long Term 24545 Short Term 1614 Name: Term, dtype: int64
loans_copy.Term.hist()
<AxesSubplot:>
# Change data type of loan term to category type
loans_copy.Term = loans_copy.Term.astype('category')
loans_copy.LoanStatus.value_counts()
Current 56576 Completed 38074 Chargedoff 11992 Defaulted 5018 Past Due (1-15 days) 806 Past Due (31-60 days) 363 Past Due (61-90 days) 313 Past Due (91-120 days) 304 Past Due (16-30 days) 265 FinalPaymentInProgress 205 Past Due (>120 days) 16 Cancelled 5 Name: LoanStatus, dtype: int64
loans_copy.LoanStatus.hist()
plt.xticks(rotation=90);
# Keep all past due loans under one heading for easier viewing
loans_copy.LoanStatus = loans_copy.LoanStatus.replace({'Past Due (1-15 days)':'Past Due', 'Past Due (16-30 days)':'Past Due',
'Past Due (31-60 days)':'Past Due', 'Past Due (61-90 days)' :'Past Due','Past Due (91-120 days)' :'Past Due',
'Past Due (>120 days)' :'Past Due'})
loans_copy.LoanStatus.value_counts()
Current 56576 Completed 38074 Chargedoff 11992 Defaulted 5018 Past Due 2067 FinalPaymentInProgress 205 Cancelled 5 Name: LoanStatus, dtype: int64
# Change data type of LoanStatus to category
loans_copy.LoanStatus = loans_copy.LoanStatus.astype('category')
loans_copy.LoanStatus.hist()
plt.xticks(rotation=90);
loans_copy.CreditGrade.value_counts()
C 5649 D 5153 B 4389 AA 3509 HR 3508 A 3315 E 3289 NC 141 Name: CreditGrade, dtype: int64
# Change datatype to category
loans_copy.CreditGrade = loans_copy.CreditGrade.astype('category')
sns.countplot(data=loans_copy, x='CreditGrade')
<AxesSubplot:xlabel='CreditGrade', ylabel='count'>
loans_copy.BorrowerRate.value_counts()
0.3177 3672
0.3500 1905
0.3199 1651
0.2900 1508
0.2699 1319
...
0.2201 1
0.0752 1
0.1416 1
0.2812 1
0.0739 1
Name: BorrowerRate, Length: 2294, dtype: int64
# Change borrower rate to float data type
loans_copy.BorrowerRate = loans_copy.BorrowerRate.astype('float')
loans_copy.BorrowerRate.describe()
count 113937.000000 mean 0.192764 std 0.074818 min 0.000000 25% 0.134000 50% 0.184000 75% 0.250000 max 0.497500 Name: BorrowerRate, dtype: float64
sns.countplot(data=loans_copy, x='BorrowerRate')
<AxesSubplot:xlabel='BorrowerRate', ylabel='count'>
loans_copy.BorrowerAPR.value_counts()
0.35797 3672
0.35643 1644
0.37453 1260
0.30532 902
0.29510 747
...
0.37266 1
0.27518 1
0.18477 1
0.29961 1
0.19543 1
Name: BorrowerAPR, Length: 6677, dtype: int64
# Change data type of APR to float
loans_copy.BorrowerAPR = loans_copy.BorrowerAPR.astype('float')
sns.countplot(data=loans_copy, x='BorrowerAPR')
<AxesSubplot:xlabel='BorrowerAPR', ylabel='count'>
loans_copy.ProsperScore.value_counts()
4.0 12595 6.0 12278 8.0 12053 7.0 10597 5.0 9813 3.0 7642 9.0 6911 2.0 5766 10.0 4750 11.0 1456 1.0 992 Name: ProsperScore, dtype: int64
#change data type to category
loans_copy.ProsperScore = loans_copy.ProsperScore.astype('category')
loans_copy.ProsperScore.dtype
CategoricalDtype(categories=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0], ordered=False)
sns.countplot(data=loans_copy, x='ProsperScore')
<AxesSubplot:xlabel='ProsperScore', ylabel='count'>
loans_copy.BorrowerState.value_counts()
CA 14717 TX 6842 NY 6729 FL 6720 IL 5921 GA 5008 OH 4197 MI 3593 VA 3278 NJ 3097 NC 3084 WA 3048 PA 2972 MD 2821 MO 2615 MN 2318 MA 2242 CO 2210 IN 2078 AZ 1901 WI 1842 OR 1817 TN 1737 AL 1679 CT 1627 SC 1122 NV 1090 KS 1062 KY 983 OK 971 LA 954 UT 877 AR 855 MS 787 NE 674 ID 599 NH 551 NM 472 RI 435 HI 409 WV 391 DC 382 MT 330 DE 300 VT 207 AK 200 SD 189 IA 186 WY 150 ME 101 ND 52 Name: BorrowerState, dtype: int64
# Find number of null values present in state abbreviations
loans_copy.BorrowerState.isnull().sum()
5515
# Percentage of missing states
loans_copy.BorrowerState.isnull().mean() * 100
4.840394252964358
# Replace the null values with the phrase "Not Specified" and convert the column to category data type
loans_copy.BorrowerState.fillna('Not Specified', inplace = True)
loans_copy.BorrowerState = loans_copy.BorrowerState.astype('category')
loans_copy.BorrowerState.dtype
CategoricalDtype(categories=['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA',
'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE',
'NH', 'NJ', 'NM', 'NV', 'NY', 'Not Specified', 'OH', 'OK',
'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT',
'WA', 'WI', 'WV', 'WY'],
, ordered=False)
# Replace the abbreviations with names for clarity using a dictonary
# First, create a dictionary of the names you would like to create
full_state_names = {'WA': 'WASHINGTON', 'VA': 'VIRGINIA', 'DE': 'DELAWARE', 'DC': 'DISTRICT OF COLUMBIA', 'WI': 'WISCONSIN',
'WV': 'WEST VIRGINIA','HI': 'HAWAII', 'FL': 'FLORIDA', 'WY': 'WYOMING', 'NH': 'NEW HAMPSHIRE',
'NJ': 'NEW JERSEY', 'NM': 'NEW MEXICO','TX': 'TEXAS', 'LA': 'LOUISIANA', 'NC': 'NORTH CAROLINA',
'ND': 'NORTH DAKOTA', 'NE': 'NEBRASKA', 'TN': 'TENNESSEE','NY': 'NEW YORK', 'PA': 'PENNSYLVANIA',
'CA': 'CALIFORNIA','NV': 'NEVADA', 'CO': 'COLORADO', 'VI': 'VIRGIN ISLANDS', 'AK': 'ALASKA',
'AL': 'ALABAMA', 'AR': 'ARKANSAS', 'VT': 'VERMONT', 'IL': 'ILLINOIS', 'GA': 'GEORGIA', 'IN': 'INDIANA',
'IA': 'IOWA', 'OK': 'OKLAHOMA', 'AZ': 'ARIZONA', 'ID': 'IDAHO', 'CT': 'CONNECTICUT', 'ME': 'MAINE',
'MD': 'MARYLAND', 'MA': 'MASSACHUSETTS','OH': 'OHIO', 'UT': 'UTAH', 'MO': 'MISSOURI','MN': 'MINNESOTA',
'MI': 'MICHIGAN', 'RI': 'RHODE ISLAND','KS': 'KANSAS', 'MT': 'MONTANA', 'MS': 'MISSISSIPPI',
'SC': 'SOUTH CAROLINA', 'KY': 'KENTUCKY', 'OR': 'OREGON', 'SD': 'SOUTH DAKOTA',
'Not Specified': 'NOT SPECIFIED' }
loans_copy.BorrowerState = loans_copy.BorrowerState.map(full_state_names)
loans_copy.BorrowerState.value_counts()
CALIFORNIA 14717 TEXAS 6842 NEW YORK 6729 FLORIDA 6720 ILLINOIS 5921 NOT SPECIFIED 5515 GEORGIA 5008 OHIO 4197 MICHIGAN 3593 VIRGINIA 3278 NEW JERSEY 3097 NORTH CAROLINA 3084 WASHINGTON 3048 PENNSYLVANIA 2972 MARYLAND 2821 MISSOURI 2615 MINNESOTA 2318 MASSACHUSETTS 2242 COLORADO 2210 INDIANA 2078 ARIZONA 1901 WISCONSIN 1842 OREGON 1817 TENNESSEE 1737 ALABAMA 1679 CONNECTICUT 1627 SOUTH CAROLINA 1122 NEVADA 1090 KANSAS 1062 KENTUCKY 983 OKLAHOMA 971 LOUISIANA 954 UTAH 877 ARKANSAS 855 MISSISSIPPI 787 NEBRASKA 674 IDAHO 599 NEW HAMPSHIRE 551 NEW MEXICO 472 RHODE ISLAND 435 HAWAII 409 WEST VIRGINIA 391 DISTRICT OF COLUMBIA 382 MONTANA 330 DELAWARE 300 VERMONT 207 ALASKA 200 SOUTH DAKOTA 189 IOWA 186 WYOMING 150 MAINE 101 NORTH DAKOTA 52 Name: BorrowerState, dtype: int64
type_order = loans_copy.BorrowerState.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='BorrowerState', order=type_order)
plt.xticks(rotation=90);
loans_copy.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
# change data type to categorical
loans_copy.Occupation = loans_copy.Occupation.astype('category')
loans_copy.Occupation.dtype
CategoricalDtype(categories=['Accountant/CPA', 'Administrative Assistant', 'Analyst',
'Architect', 'Attorney', 'Biologist', 'Bus Driver',
'Car Dealer', 'Chemist', 'Civil Service', 'Clergy',
'Clerical', 'Computer Programmer', 'Construction', 'Dentist',
'Doctor', 'Engineer - Chemical', 'Engineer - Electrical',
'Engineer - Mechanical', 'Executive', 'Fireman',
'Flight Attendant', 'Food Service',
'Food Service Management', 'Homemaker', 'Investor', 'Judge',
'Laborer', 'Landscaping', 'Medical Technician',
'Military Enlisted', 'Military Officer', 'Nurse (LPN)',
'Nurse (RN)', 'Nurse's Aide', 'Other', 'Pharmacist',
'Pilot - Private/Commercial',
'Police Officer/Correction Officer', 'Postal Service',
'Principal', 'Professional', 'Professor', 'Psychologist',
'Realtor', 'Religious', 'Retail Management',
'Sales - Commission', 'Sales - Retail', 'Scientist',
'Skilled Labor', 'Social Worker',
'Student - College Freshman',
'Student - College Graduate Student',
'Student - College Junior', 'Student - College Senior',
'Student - College Sophomore', 'Student - Community College',
'Student - Technical School', 'Teacher', 'Teacher's Aide',
'Tradesman - Carpenter', 'Tradesman - Electrician',
'Tradesman - Mechanic', 'Tradesman - Plumber',
'Truck Driver', 'Waiter/Waitress'],
, ordered=False)
loans_copy.EmploymentStatus.value_counts()
Employed 67322 Full-time 26355 Self-employed 6134 Not available 5347 Other 3806 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
# Change data type to category
loans_copy.EmploymentStatus = loans_copy.EmploymentStatus.astype('category')
type_order = loans_copy.EmploymentStatus.value_counts().index
sns.countplot(data=loans_copy, y='EmploymentStatus', order = type_order)
<AxesSubplot:xlabel='count', ylabel='EmploymentStatus'>
loans_copy.IncomeRange.value_counts()
$25,000-49,999 32192 $50,000-74,999 31050 $100,000+ 17337 $75,000-99,999 16916 Not displayed 7741 $1-24,999 7274 Not employed 806 $0 621 Name: IncomeRange, dtype: int64
loans_copy.IncomeRange.describe()
count 113937 unique 8 top $25,000-49,999 freq 32192 Name: IncomeRange, dtype: object
type_order = loans_copy.IncomeRange.value_counts().index
sns.countplot(data=loans_copy, y='IncomeRange', order=type_order, color='blue')
<AxesSubplot:xlabel='count', ylabel='IncomeRange'>
loans_copy.IncomeVerifiable.value_counts()
True 105268 False 8669 Name: IncomeVerifiable, dtype: int64
loans_copy.IncomeVerifiable.dtype
dtype('bool')
sns.countplot(data=loans_copy,x='IncomeVerifiable')
<AxesSubplot:xlabel='IncomeVerifiable', ylabel='count'>
# Structure of the dataframe
loans_copy.shape
(113937, 81)
loans_copy.IsBorrowerHomeowner.value_counts()
True 57478 False 56459 Name: IsBorrowerHomeowner, dtype: int64
loans_copy.IsBorrowerHomeowner.dtype
dtype('bool')
sns.countplot(data=loans_copy, x='IsBorrowerHomeowner')
<AxesSubplot:xlabel='IsBorrowerHomeowner', ylabel='count'>
sns.countplot(data=loans_copy, x='ProsperScore')
<AxesSubplot:xlabel='ProsperScore', ylabel='count'>
The dataset structure is made up of 113937 rows and 81 columns
The main features of interest are the loan terms, the loan status, the interest and APR and what factors determine the allocation of loans as well as payment of loans.
The features in the dataset that will help support the investigation are the loan terms, the loan status, the credit grade, the interest and APR, the borrower state, the occupation of the borrower, the employment status, the income range and the verifiable income of the borrower.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
# Create a function for drawing of countplots
def draw_countplot(x, title):
plt.figure(figsize=(10,5))
visual = sns.countplot(x=x,data=loans_copy, color='blue')
for txt in visual.patches:
visual.annotate('{:.0f}'.format(txt.get_height()), (txt.get_x()+0, txt.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation=90)
plt.plot()
plt.show()
draw_countplot('Term','Plot Showing the Distribution of Loan Terms')
From the plot above, it seems medium term loans are the most disbursed loans followed by the long term loans. Without ordering the bars in the plot, the distribution is unimodal and the distribution appears to be normal that is, not skewed to either direction
loans_copy.IncomeRange.value_counts()
$25,000-49,999 32192 $50,000-74,999 31050 $100,000+ 17337 $75,000-99,999 16916 Not displayed 7741 $1-24,999 7274 Not employed 806 $0 621 Name: IncomeRange, dtype: int64
type_order = loans_copy.IncomeRange.value_counts().index
sns.countplot(data=loans_copy, y='IncomeRange', order=type_order, color='blue');
plt.title(' Distribution of Income Range');
From the plot above, it is obvious that most of the loans fall within the income range of 25,000 - 49,999 followed by the 50,000 - 74,999 income range. It would be safe to say most of the loan applicants are middle-income earners.
loans_copy.EmploymentStatus.value_counts()
Employed 67322 Full-time 26355 Self-employed 6134 Not available 5347 Other 3806 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
type_order = loans_copy.EmploymentStatus.value_counts().index
sns.countplot(data=loans_copy, y='EmploymentStatus', order = type_order, color='blue')
plt.title('Plot Showing Distribution of Employment Status');
The plot above shows that most of the loan applicants are employed. This should make sense because one of the terms for eligibility for a loan should be a means of paying the loan which is having some form of employment
draw_countplot('LoanStatus','Plot Showing the Distribution of Loan Status')
From the plot above, it is obvious that most of the loans are still being currently serviced. However, the next highest metric shows that 38,074 loans have been completed. It may indicate a strong possibility that most of the loans that are current would be completed.
loans_copy.BorrowerState.value_counts()
CALIFORNIA 14717 TEXAS 6842 NEW YORK 6729 FLORIDA 6720 ILLINOIS 5921 NOT SPECIFIED 5515 GEORGIA 5008 OHIO 4197 MICHIGAN 3593 VIRGINIA 3278 NEW JERSEY 3097 NORTH CAROLINA 3084 WASHINGTON 3048 PENNSYLVANIA 2972 MARYLAND 2821 MISSOURI 2615 MINNESOTA 2318 MASSACHUSETTS 2242 COLORADO 2210 INDIANA 2078 ARIZONA 1901 WISCONSIN 1842 OREGON 1817 TENNESSEE 1737 ALABAMA 1679 CONNECTICUT 1627 SOUTH CAROLINA 1122 NEVADA 1090 KANSAS 1062 KENTUCKY 983 OKLAHOMA 971 LOUISIANA 954 UTAH 877 ARKANSAS 855 MISSISSIPPI 787 NEBRASKA 674 IDAHO 599 NEW HAMPSHIRE 551 NEW MEXICO 472 RHODE ISLAND 435 HAWAII 409 WEST VIRGINIA 391 DISTRICT OF COLUMBIA 382 MONTANA 330 DELAWARE 300 VERMONT 207 ALASKA 200 SOUTH DAKOTA 189 IOWA 186 WYOMING 150 MAINE 101 NORTH DAKOTA 52 Name: BorrowerState, dtype: int64
type_order = loans_copy.BorrowerState.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='BorrowerState', order=type_order)
plt.xticks(rotation=90);
plt.title('Plot showing Distribution of Borrowers States');
The plot above shows that most of the lenders are from the state of California with Texas a distant second. This could be indicative of spending practices or the income levels of the people of these states. More detail on this could be gotten from subsequent bivariate oe even multivariate plots
loans_copy.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
type_order = loans_copy.Occupation.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='Occupation', order=type_order)
plt.xticks(rotation=90);
The plot above shows that most of the loan applicants have their occupations defined as Other. In other words, their occupation was not captured in the list of occupations. It is highly probable that these people are business owners as there is a nothing to account for business owners in this list
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='CreditGrade', order=['A','E','C','AA','B','D','HR'], color='blue')
plt.xticks(rotation=90);
plt.title('Plot showing Distribution of Credit Grades');
loans_copy.CreditGrade.value_counts()
C 5649 D 5153 B 4389 AA 3509 HR 3508 A 3315 E 3289 NC 141 Name: CreditGrade, dtype: int64
The plot shows that most of the credit grades of the loans are actually of rating C followed by D. It also shows that there is a normal distribution of the credit grades of the borrowers and it is unimodal with Credit Grade C being the highest
draw_countplot('IncomeVerifiable','Plot Showing the Distribution of the Verifiable Income of Borrowers')
The plot above shows that most of the borrowers (105,268) had a verifiable source of income. We will see if this plays a role in the granting of loans
draw_countplot('ProsperScore','Plot Showing the Distribution of Prosper Score Ratings')
The plot shows a normal distribution of the ProsperScore rating. Also, it seems that the highest Prosper Score ratings are 4.0, 6.0 and 8.0.
From the plots, a lot of insights were gotten. Firstly, for the loan terms, for the sake of clarity, I had to convert the terms, 12 months, 36 months and 60 months into short, medium and long term respectively. Then this was converted to categorical data type and from the plot, it can be seen that the distribution is unimodal with medium term loans having the highest number 0f 87778 borrowers.
For the income range, it was interesting to see that most of the borrowers fall withing the 25,000-49,999 range which makes up 32,192 borrowers followed by 50,000-74,999 which makes up 31,050. The data had to be converted to the categorical type for easier analysis.
Also, most of the borrowers seem to be gainfully employed with 67,322 of all the borrowers being employed. This accounts for 60 % of all the borrowers. The arrangement of the bars in the chart have been changed so that they can be shown in descending order. This helps in bringing more clarity to the plot.
The loan status plot shows that majority of the loans are being currently serviced. Before, plotting the chart for this, first I had to merge all past due loans under one category for easier analysis. Then, I had to convert the data type to the category data type for easier analysis. The plot shows 56,576 borrowers are currently servicing their loans with 38,074 of the loans being paid off completely. Those two metrics are an indicator that most loans from Prosper Loan will be paid regularly and completed.
For the borrower's state, firstly I had to change the state abbreviations to full names for more clarity. Then, that column was converted to the categorical data type. From the plot, it is obvious that most of the borrowers are from the state of California which accounts for 13% of all lenders alone. Texas comes a distant second with 6% of all lenders.
Then, for the Prosper Score rating, which is a measure of the risk level of the borrower. First, this had to be converted to categorical data type and then plotted. Since this was an ordered data set, then we did not need to arrange it in ascending or descending order. The plot shows a normal distribution of the Prosper Score rating with the ratings, 4, 6 and 8 having the highest number of borrowers.
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
def bivariate_count_plot(x, hue, title):
plt.figure(figsize=(15,10))
ax = sns.countplot(x = x, hue = hue, data =loans_copy)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
ax.margins(y=0.1)
plt.tight_layout()
plt.plot()
plt.show();
bivariate_count_plot('ProsperScore', 'Term', 'Plot of Prosper SCore against Loan Term')
From the clustered bar chart above, it would seem that medium term loans are in the highest demand, irrespective of the Prosper Score. It depicts that irrespective of the level of risk a borrower poses, borrowers seem more inclined towards medium term loans.
bivariate_count_plot('ProsperScore', 'IncomeRange', 'Plot of Prosper Score against Income Range')
From the plot, it appears that as the Prosper Score increases, there is also a commensurate increase in the income ranges across all income ranges. Especially when we look at Propser Scores from 4.0 to 8.0, this is more evident. This is an indication that income plays a key role in determining the risk level of the borrowers
bivariate_count_plot('LoanStatus', 'IncomeVerifiable', 'Plot of Loan Status against Verifiable Income')
From the plot above, it shows that having a verifiable income pays in a key role in determining the status of a loan. For instance, we can see that most of those with a verifiable source of income have either current loans or have completed the the payment of their loans. Even though, other loan status' also have borrowers with a verifiable source of income, these two variables can help the bank in determining the type of borrowers to lan money to
bivariate_count_plot('EmploymentStatus', 'IncomeVerifiable', 'Plot of Employment Status against Verifiable Income')
The plot shows that employment status is a key indicator of having verifiable income. Also, being gainfully employed seems related to having a verifiable source of income. Case in point, the muber of people employed and having full-time employment seem to have the highest numbers, 65902 and 25552. Also, it shows at the tail of the graph that most of the borrowers that are self-employed do not have a verifiable source of income
bivariate_count_plot('EmploymentStatus', 'IncomeRange', 'Plot of Employment Status against Income Range')
bivariate_count_plot('EmploymentStatus', 'IncomeRange', 'Plot of Employment Status against Income Range')
Looking at this plot of employment status and income range, it shows that employed and full time employees that applied for loans fall mostly within the 25,000 - 75,000 income range. This shows that most ot the borrowers are employed and earn between 25,000 and 75,000. This could help in knowing the demographics of the borrowers
Firstly, a plot of the Prosper Score along with the Loan terms was done using a clustered bar chart. From the charts, we can see that most people, irrespective of their Prosper Score, seemed more inclined towards medium term loans. This was a bit surprising because I would expect that people with a lower risk score, who would be more financially buoyant, would like to borrow more long term loans while higher risk borrowers would gravitate towards short term loans.
For the plot of Income Range and Prosper Score, it can be seen that from Prosper Score 1 to 5, there seem to be more borrowers within the 25,000 to 49,900 USD range. However, from Prosper Score 6 t0 9, it seems that most of the borrowers fall within the 50,000 to 75,000 USD and as we move more towards the highest Prosper score, we have people more in the 100,000+ income range. This points out that the higher your income range, then the better your Prosper Score.
For the plot of Verifiable Income against Loan Status, it shows that irrespective of your loan status, having a verifiable source of income goes a long way in making sure you get the loan in the first place since all the loan status' have a higher number of borrowers with verifiable income compared to borrowers with no verifiable income.
For the plot of the Employment Status and Verifiable Income, it shows that being employed goes a long way in determining if you have a verifiable source of income and increases your chances of getting a loan. However, for those unemployed, unsurprisingly, there are more borrowers with an unverifiable source of income. For those self-employed though, surprisingly, there are more borrowers with unverifiable income compared to verifiable income. I would expect that self employed people would track their business better and so their income can be verified.
For the plot of Employment Status and Income Range, it seems for the employed, full-time and self employed borrowers, most of their borrowers fall within the 25,000 to 49,900 USD range and 50,000 to 75,000 USD. However, part time and retired, their income range is from 1-24,999.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
def draw_scatter(hue, title):
plt.figure(figsize=(15,10), dpi=400)
sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = loans_copy)
plt.xlabel('Loan Original Amount (USD)')
plt.ylabel('Monthly Loan Payment (USD)')
plt.title(title);
draw_scatter('Term', 'Original Loan Amount Against Monthly Loan Payments by Loan Term.')
By looking at the plot, it shows a positive correlation between the loan amounts and the monthly payments . However, the slope of each term seems to get steeper as the term get shorter. For instance, we can see that for short term loans,the amount paid monthly seems higher compared to medium or long term loans. This may have something to do with the timeframe and the bank wanting to make its money back as quickly as possible within the short timeframe
draw_scatter('EmploymentStatus', 'Original Loan Amount Against Employment Status.')
Looking through the plot, we can see most of the borrowers are employed and as there is a positive correlation between the monthly loan payment and the original loan amount
Comparing the Original Loan Amount against the Monthly Payments for the 3 types of loan terms, it can be seen that there is a positive correlation between loan amount and monthly payments for all 3 loan terms. However, it seems more pronounced when we come to short term loans than any other loan terms. The reason for short term loans being more pronounced is not too clear. However, I would assume that it had more to do with the fact that due to the shortness of the duration of the loan, more money has to be paid monthly to meet up with the deadline
Comparing the Original Loan Amount against the Monthly Payments based on Employment Status, it can be seen that borrowers that were fully employed seemed to pay a higher amount monthly compared to their original loan amount. It in some ways relates to the previous plots that showed a high amount of people gravitating towards medium and short term loans. That is a bit surprising.
From the questions asked and the analysis and visualizations, it can be deduced that income range plays a key role in the granting of loans to prospective borrowers. The higher your income range, the lesser you pos e a risk to the bank, that is, your Prosper Score is higher. Also, employment status is a key determinant in selecting borrowers. Though, it seems in the dataset, there are a lot of borrowers that fall under the others employment status, it could be more defined subsequently to allow for better analysis.